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This  survey  studies  basic  concepts  of  graph  mining  as  well  as  Social  Networks.  Social  networks  have  been  widely 
used  now-a-days  such  as  Facebook,  Linked-In,  Google+,  etc.  Users  of  these  sites  form  a  social  network,  which  provides  a 
powerful  means  of  sharing,  organizing,  finding  contents  and  contacts.  Social  Network  can  be  cast  as  graph. 
Users  represented  as  "nodes"  and  their  relationship  is  represented  by  "links".  This  allows  us  to  characterize  the  network 
and  analyze  the  network.  Here  presented  some  challenges  in  crawling. 
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Social  Network  Analysis  (SNA)  is  the  study  of  relations  between  individuals  including  the  analysis  of  social 
structures,  social  position,  role  analysis,  and  many  others  [29].  These  are  represented  as  graph  with  node  representing 
individual  or  group  and  edges  as  relation  between  them. 

With  the  prosperity  of  Internet  and  Web  2.0,  many  social  networking  and  social  media  sites  are  emerging,  and 
people  can  easily  connect  to  each  other  in  the  cyber  space.  This  also  facilitates  SNA  to  a  much  larger  scale  —  millions  of 
actors  or  even  more  in  a  network;  Examples  include  email  communication  networks'11,  instant  messenger  networks  [21, 
mobile  call  networks  [3l  Other  forms  of  complex  network  like,  biological  networks,  metabolic  pathways,  genetic  regulatory 
networks,  food  web  and  neural  networks,  are  also  examined  and  demonstrate  similar  patterns  [  \  So  the  analysis  of  social 
networks  has  recently  experienced  a  surge  of  interest  researchers.  OSN  data  analysis  has  a  great  potential  for  researchers  in 
a  diversity  of  disciplines.  However,  we  propose  that  OSN  analysis  should  be  placed  in  the  context  of  its  sociological 
origins  and  its  basis  in  graph  theory. 

Graph  Mining  of  on-line  social  networks  is  a  relatively  new  area  of  research  which  however  has  a  solid  base  in 
classic  graph  theory,  computational  cost  considerations,  and  sociological  concepts  such  how  individuals  interrelate,  group 
together  and  follow  one  another  151  The  structure  of  this  paper  is  as  follow:  section  -  2  gives  brief  introduction  about  graphs 
and  Social  network.  Section-3  gives  introduction  to  graph  mining.  Section  -4  and  5  gives  challenges  in  crawling  social 
network  and  crawling  in  social  network  respectively. 


A  graph  G  is  represented  as  G  (V,  £)  where  V  is  a  set  of  vertices  (or  nodes)  and  £  is  a  set  of  edges  (or  links) 
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connecting  some  vertex  pairs  in  V.  Statistically,  a  graph  can  be  characterized  by  derived  values  such  as  the  average  degree 
of  the  nodes  and  the  average  path  length  between  nodes.  Additional  characteristics  are  the  graphs  diameter,  the  number  of 
triangles,  the  number  of  is  omorphisms  and  the  clustering  coefficient,  among  others. 


A  c  E 


Figure  1:  Simple  Graph  with  Five  Vertices  and  Five  Edges 

In  figure  1  we  see  an  elementary  graph  with  five  vertices  and  five  edges.  As  there  are  no  arrows,  we  assume  it  is 
undirected,  and  as  the  edges  have  no  additional  information  attached  we  assume  it  is  un-weighted.  We  see  that  nodes  A,  B 
and  D  have  degree  2,  node  C  has  degree  3  and  node  E  has  degree  1,  hence  the  degree  sequence  is  { 1,  2,  2,  2,  and  3}. 

There  are  many  types  of  graphs:  directed,  undirected,  graphs  with  weights  on  the  edges,  vertices  or  both. 
An  'undirected  graph'  has  no  information  about  the  direction  or  flow  between  nodes.  That  is,  the  edge  between  two 
vertices  A  and  B  is  identical  to  the  edge  between  vertices  B  and  A.  A  'directed  graph',  on  the  other  hand,  does  include 
directional  information.  Each  edge  will  have  a  direction  associated  with  it,  which  can  be  unidirectional  A— *  B  or 
bidirectional  A<->  B.  A  'weighted  graph'  includes  additional  information  associated  with  an  edge  or  a  vertex. 

Graphs  Used  in  Following  Areas 

•  Internet/Computer  Networks 

•  WWW 

•  Social  Networks 

•  Transport  Networks 

•  Many  more  . . . 
2.2  Social  Network 

A  Social  Network  is  a  social  structure  made  up  of  a  set  of  individual  (or  organization  etc.)  tied  together  by  link. 
This  link  can  be  undirected  or  directed.  Each  individual  is  called  an  Actor  and  link  is  called  the  relationship  between  those 
actors.  Both  links  and  node  have  attributes. 


Figure  2:  Science  Co-Author  Graph 
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2.2.1  Characteristics  of  Network 

Almost  all  large  real-world  networks  evolve  over  time  by  the  addition  and  deletion  of  nodes  and  edges.  Most  of 
the  recent  models  of  network  evolution  capture  the  growth  process  in  a  way  that  incorporates  two  pieces  of  "conventional 
wisdom:" 

•  Constant  Average  Degree  Assumption:  The  average  node  degree  in  the  network  remains  constant  over  time. 
(Or  equivalently,  the  number  of  edges  grows  linearly  in  the  number  of  nodes.) 

•  Slowly  Growing  Diameter  Assumption:  The  diameter  is  a  slowly  growing  function  of  the  network  size,  as  in 
"small  world"  graphs  [  ]' 

But  we  observed  following  phenomena  on  some  datasets  . . . 

•  Empirical  Observation:  Densification  Power  Laws.  The  networks  are  becoming  denser  over  time  with  the 
average  degree  increasing  (and  hence  with  the  number  of  edges  growing  super  linearly  in  the  number  of  nodes). 
Moreover,  the  densification  follows  a  power-law  pattern. 

e(f)  ocn(t)Aa 

Where  e(t)  is  number  of  edges  and  n(t)  is  number  of  nodes  a  is  exponent  strictly  between  1  and  2 

•  Empirical  Observation:  Shrinking  Diameters.  The  effective  diameter  is,  in  many  cases,  actually  decreasing  as 
the  network  grows  . 

We  view  the  second  of  these  findings  as  particularly  surprising:  Rather  debating  over  exactly  how  the  graph 
diameter  grows  as  a  function  of  the  number  of  nodes. 

Densification  Law 

Here  we  describe  the  datasets  we  used,  and  our  findings  related  to  densification.  For  each  graph  dataset,  we  have, 
or  can  generate,  several  time  snapshots,  for  which  we  study  the  number  of  nodes  n(t)  and  the  number  of  edges  e(t)  at  each 
timestamp  t.  We  denote  by  n  and  e  the  final  number  of  nodes  and  edges.  We  use  the  term  Densification  Power  Law  plot 
(or  just  DPL  plot)  to  refer  to  the  log -log  plot  of  number  of  edges  e(t)  versus  number  of  nodes  n(t). 


)  i —  — ■  ■  "  10  '  ■ — ■ —  ■  < 
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(a)  (b) 
Figure  3:  (a)  Arxiv  (b)  Patent 
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Figure  3(a)  shows  the  DPL  plot;  the  slope  is  a  =  1.68  and  corresponds  to  the  exponent  in  the  densification  law. 
Notice  that  a  is  significantly  higher  than  1,  indicating  a  large  deviation  from  linear  growth. 

Figure  3(b)  shows  similar  pattern  as  figure  (a)  with  a=1.66. 

Shrinking  Diameter 

The  effective  diameter  used  in  earlier  work:  the  minimum  value  d  such  that  at  least  90%  of  the  connected  node 
pairs  are  at  distance  at  most  d.  The  effective  diameter  is  a  more  robust  quantity  than  the  diameter  (defined  as  the  maximum 
distance  over  all  connected  node  pairs),  since  the  diameter  is  prone  to  the  effects  of  degenerate  structures  in  the  graph 
(e.g.  very  long  chains). 


HW2      1594      1996      «99a      2DM      2M2      2M4  1^7S         1980         19BS         1390         ,99s  50011 

Time  [(est]  Time  [yeas] 

(a)  (b) 
Figure  4:  (a)  Arxiv  Citation  Graph  (b)  Patents 

Figure  4  shows  the  effective  diameter  over  time;  one  observes  a  decreasing  trend  for  all  the  graphs. 

Basic  Characteristic  of  Social  Network 

•  Node's  Degree  -  Number  of  nodes  incident  to  each  node 

•  Network  Diameter  -  Maximum  distance  between  pairs  of  node 

•  Effective  Diameter  -  Minimum  distance  d,  such  that  for  at  least  90%  of  reachable  node  pairs. 

•  Average  Diameter  -  Other  node-to-node  distance 
2.  GRAPH  MINING 

Graph  Mining  can  be  considered  a  specialization  of  Data  Mining,  the  objective  of  the  latter  being  to  process  data 
which  is  difficult  for  humans  to  meaningfully  interpret,  and  identify/extract  high  value  knowledge  from  the  data. 
For  example  in  data  mining  application  analyze  the  data  by  the  techniques  which  are  in  general  statistical  analysis  and/or 
machine  learning  techniques  using  artificial  intelligence.  Thus  graph  mining  similar  to  data  mining  except  that  it  is  applied 
to  graph. 

3.1  Graph  Mining  Methods 
3.1.1  Apriori-Based  Approach 

Apriori-based  frequent  substructure  mining  algorithms  share  similar  characteristics  with  Apriori-based  frequent 
item  set  mining  algorithms.  The  search  for  frequent  graphs  starts  with  graphs  of  small  "size,"  and  proceeds  in  a  bottom-up 
manner  by  generating  candidates  having  an  extra  vertex,  edge,  or  path. 
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The  general  framework  of  Apriori-based  methods  for  frequent  substructure  mining  is  outlined  in  algorithm  below. 
Skis  the  frequent  substructure  set  of  size  k.  At  each  iteration,  the  size  of  newly  discovered  frequent  substructures  is 
increased  by  one.  These  new  substructures  are  first  generated  by  joining  two  similar  but  slightly  different  frequent  sub 
graphs  that  were  discovered  in  the  previous  call  to  Apriori  Graph.  The  frequency  of  the  newly  formed  graphs  is  then 
checked.  Those  found  to  be  frequent  are  used  to  generate  larger  Candidates  in  the  next  round. 

However,  the  candidate  generation  problem  in  frequent  substructure  mining  is  harder  than  that  in  frequent  item  set 
mining,  because  there  are  many  ways  to  join  two  substructures. 

Algorithm 

Input: 

D,  a  graph  data  set; 

min  sup,  the  minimum  support  threshold 
Output: 

Sk,  the  frequent  substructure  set 
51  frequent  single-elements  in  the  data  set; 
Call  Apriori  Graph(Z),  min  sup,  SI); 
Procedure  Apriori  Graph(Z),  min  sup,  Sk) 

•  Sk+l  ?; 

•  for  each  frequent  gil  Skdo 

•  for  each  frequent  gjl  Skdo 

•  for  each  size  (k+l)  graph  g  formed  by  the  merge  of  giand  gjdo 

•  ifg  is  frequent  in  D  and  g  62  Sk+l  then 

•  insert  g  into  Sk+l; 

•  if  sk+l  6=  ?  then 

•  Apriori  Graph(D,  min  sup,  Sk+l); 

•  return; 

3.1.2  Pattern-Growth  Approach 

The  Apriori-based  approach  has  to  use  the  breadth-first  search  (BFS)  strategy  because  of  its  level-wise  candidate 
generation.  In  contrast,  the  pattern-growth  approach  is  more  flexible  regarding  its  search  method.  It  can  use  breadth-first 
search  as  well  as  depth-first  search  (DFS),  the  latter  of  which  consumes  less  memory.  A  graph  g  can  be  extended  by  adding 
a  new  edge  e.  The  newly  formed  graph  is  denoted  by  g<)x  e.  Edge  e  may  or  may  not  introduce  a  new  vertex  to  g.  If  e 
introduces  a  new  vertex,  we  denote  the  new  graph  by  g()Xf  e,  otherwise,  g<)xb  e,  where/or  b  indicates  that  the  extension  is 
in  a  forward  or  backward  direction. 
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A  general  framework  for  pattern-growth  frequent  sub-structure  mining  is  illustrated  below.  It  is  simple  but  not 
efficient.  Same  graph  can  be  discovered  many  times  makes  it  inefficient. 

Algorithm 

Input: 

g,  a  frequent  graph; 

D,  a  graph  data  set; 

min  sup,  minimum  support  threshold 

Output: 

The  frequent  graph  set,  S 
S^0 

Call  Pattern  Growth  Graph(g,  D,  min_sup,  S); 
Procedure  Pattern  Growth  Graph(g,  D,  min_sup,  S) 

•  ifg  ES  then  return; 

•  else  insert  g  into  S; 

•  scan  D  once,  find  all  the  edges  e  such  that  g  can  be  extended  to  g<)x  e; 

•  for  each  frequent  gOx  <?do 

•  Pattern  Growth  Graph(g§x  e,  D,  min  sup,  S); 

•  return; 

3.2  Application  of  Graph  Pattern  Mining 

•  Mining  biochemical  structures 

•  Finding  biological  conserved  sub  networks 

•  Finding  functional  modules 

•  Program  control  flow  analysis 

•  Intrusion  network  analysis 

•  Mining  communication  networks 

•  Anomaly  detection 

•  Mining  XML  structures 

•  Building  blocks  for  graph  classification,  clustering,  compression,  comparison,  correlation  analysis,  and  indexing 
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3.3  Graph  Generation  Models 

•  Random  Graphs 

o     Gives  few  components  and  small  diameter 

o     Does  not  give  high  clustering  and  heavy-tailed  degree  distributions 
o     Is  the  mathematically  most  well-studied  and  understood  model 

•  Watts-Strogatz  Model 

o     give  few  components,  small  diameter  and  high  clustering 
o     does  not  give  heavy-tailed  degree  distributions 

•  Scale-Free  Networks 

o     Gives  few  components,  small  diameter  and  heavy-tailed  distribution 
o     Does  not  give  high  clustering 

•  Hierarchical  Networks 

o     few  components,  small  diameter,  high  clustering,  heavy-tailed 

3.  CHALLENGES  IN  CRAWLING 
Crawling  the  Entire  Connected  Graph 

The  primary  challenge  in  crawling  large  graphs  is  covering  the  entire  connected  component.  In  the  case  of  online 
social  networks,  crawling  the  graph  efficiently  is  important  since  the  graphs  are  large.  Common  algorithms  for  crawling 
graphs  include  breadth-first  search  (BFS)  and  depth-first  search. 

Speed  of  Crawling 

As  social  network  are  highly  dynamic  it  changes  over  very  quickly.  So  nodes  and  edges  are  keep  adding  and 
removing  from  the  graph. 

Type  of  Graph 

There  are  directed  and  undirected  graph.  So  crawling  directed  graph,  as  opposed  to  undirected  graph,  presents 
additional  challenges.  Many  graph  can  be  crawled  by  only  using  forward  links.  But  it  does  not  crawl  the  entire  graph 
instead  it  explores  connected  component  reachable  from  set  of  seed  node. 

5.  CONCLUSIONS 

As  this  paper  studies  about  the  basic  of  graph  theory,  graph  mining  and  social  networking.  It  has  also  presented 
some  of  the  characteristics  of  social  networking.  Now  it  can  be  learned  that  social  network  uses  scale-free  network  model 
to  built  graph.  It  has  also  presented  case  study  on  the  analysis  of  Facebook  site  and  measure  the  some  of  the  metrics  and 
compared  with  two  sampling  techniques. 
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